Introduction to Open Data Science - Course Project

About the project

This is my course diary for IODS course. Feeling good. Hoping to learn some R.

Here is my github project: https://github.com/ukkelinkamu/IODS-project

Thoughts on chapter 1 and 4

I’m somewhat familiar with R and using ggplot so this was not too difficult. Still had to think a little bit before could get the 4.6.5 excercise done (below). I also did like the final graphs in chapter 4, maybe could have some ideas from here to my own work also. R markdown is great way to approach this I think.

date()
## [1] "Mon Dec 11 13:39:27 2023"

Chapter 2 - Regression

We have learned basics of doing regression models and summaraising the data

#I created a csv file in the previous task so I start by reading it
df <- read.csv("a_learn14.csv")
df <- df[,-1]
#exploring the structure and dimensions
str(df)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(df)
## [1] 166   7
#There seems to be 166 rows and 7 columns. The data is from student survey. the survey questions had 3 categories and in this dataset they are averaged to 3 columns: attitudes, deep, surf and stra. There is also columns for age and gender.

#access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)


pairs(df[-1])

summary(df)
##     gender               Age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
p <- ggpairs(df, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p

#Respondents are young, mean 25.5 years old. 2/3 of respondents are female. The exam points correlate strongly with attitude related questions. 

# fit a linear model
my_model <- lm(Points ~ attitude+stra+surf, df)

# print out a summary of the model
my_model
## 
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = df)
## 
## Coefficients:
## (Intercept)     attitude         stra         surf  
##     11.0171       3.3952       0.8531      -0.5861
summary(my_model)
## 
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
#attitude is strongly and statistically significantly correlated with points. correlation coefficient is 3.4661 and p-value 1.18e^-08. Other correlations are not statistically significant.

#new model without explanatory variables that were not statistically significantly correlated with points
my_model <- lm(Points ~ attitude, df)

#multiple r-squared is 0.1906, which is semi low and means that the data do not fit the model very well, data points are far away from the regression line on average. This can be also used to estimate if the model can predict how good points participant with specific attitudes would get, in this case the model predicts it badly.

#diagnostic plots
plot(my_model, which = c(1,2,5)) 

#diagnostic plost show the model is valid. data points are equally scattered and the red line does not have large curves.

date()
## [1] "Mon Dec 11 13:39:32 2023"

Here we go again…


Chapter 3 - logistic regression

We have learned basics of logistic regression

First, read the datafile. It is about student exams, including 370 respondents, and 36 variables about student characteristics and test performance.

df <- read.csv("chap3.csv")
colnames(df)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "guardian"   "traveltime" "studytime" 
## [16] "schoolsup"  "famsup"     "activities" "nursery"    "higher"    
## [21] "internet"   "romantic"   "famrel"     "freetime"   "goout"     
## [26] "Dalc"       "Walc"       "health"     "failures"   "paid"      
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"

My hypothesis is that absences, failures, higher age and male sex is associated with high alcohol use.

Participants are 15-19 years old. Absences are distributed mostly to people with many absences, most students have only few of them. Also most of the students have 0 failures.

By reading the graphs, sex, absences, and age seems to be correlated with high alcohol use. failures are rare and more difficult to explore with these plots.

library(ggplot2)
library(tidyr)

#defining interesting variables
var <- c("failures", "absences", "age", "sex")
summary(df)
##        X             school              sex                 age       
##  Min.   :  1.00   Length:370         Length:370         Min.   :15.00  
##  1st Qu.: 93.25   Class :character   Class :character   1st Qu.:16.00  
##  Median :185.50   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :185.50                                         Mean   :16.58  
##  3rd Qu.:277.75                                         3rd Qu.:17.00  
##  Max.   :370.00                                         Max.   :22.00  
##    address            famsize            Pstatus               Medu    
##  Length:370         Length:370         Length:370         Min.   :0.0  
##  Class :character   Class :character   Class :character   1st Qu.:2.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :3.0  
##                                                           Mean   :2.8  
##                                                           3rd Qu.:4.0  
##                                                           Max.   :4.0  
##       Fedu           Mjob               Fjob              reason         
##  Min.   :0.000   Length:370         Length:370         Length:370        
##  1st Qu.:2.000   Class :character   Class :character   Class :character  
##  Median :3.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2.557                                                           
##  3rd Qu.:3.750                                                           
##  Max.   :4.000                                                           
##    guardian           traveltime      studytime      schoolsup        
##  Length:370         Min.   :1.000   Min.   :1.000   Length:370        
##  Class :character   1st Qu.:1.000   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :1.000   Median :2.000   Mode  :character  
##                     Mean   :1.446   Mean   :2.043                     
##                     3rd Qu.:2.000   3rd Qu.:2.000                     
##                     Max.   :4.000   Max.   :4.000                     
##     famsup           activities          nursery             higher         
##  Length:370         Length:370         Length:370         Length:370        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    internet           romantic             famrel         freetime    
##  Length:370         Length:370         Min.   :1.000   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:4.000   1st Qu.:3.000  
##  Mode  :character   Mode  :character   Median :4.000   Median :3.000  
##                                        Mean   :3.935   Mean   :3.224  
##                                        3rd Qu.:5.000   3rd Qu.:4.000  
##                                        Max.   :5.000   Max.   :5.000  
##      goout            Dalc            Walc           health     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :3.000   Median :1.000   Median :2.000   Median :4.000  
##  Mean   :3.116   Mean   :1.484   Mean   :2.295   Mean   :3.562  
##  3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     failures          paid              absences            G1       
##  Min.   :0.0000   Length:370         Min.   : 0.000   Min.   : 2.00  
##  1st Qu.:0.0000   Class :character   1st Qu.: 1.000   1st Qu.:10.00  
##  Median :0.0000   Mode  :character   Median : 3.000   Median :12.00  
##  Mean   :0.1892                      Mean   : 4.511   Mean   :11.52  
##  3rd Qu.:0.0000                      3rd Qu.: 6.000   3rd Qu.:14.00  
##  Max.   :3.0000                      Max.   :45.000   Max.   :18.00  
##        G2             G3           alc_use       high_use      
##  Min.   : 4.0   Min.   : 0.00   Min.   :1.000   Mode :logical  
##  1st Qu.:10.0   1st Qu.:10.00   1st Qu.:1.000   FALSE:259      
##  Median :12.0   Median :12.00   Median :1.500   TRUE :111      
##  Mean   :11.5   Mean   :11.52   Mean   :1.889                  
##  3rd Qu.:14.0   3rd Qu.:14.00   3rd Qu.:2.500                  
##  Max.   :18.0   Max.   :18.00   Max.   :5.000
#drawing graphs of the variables of interest
gather(df[,var]) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free")

#exploring associations with alcohol use
g1 <- ggplot(df, aes(x = high_use, col = sex))
g2 <- ggplot(df, aes(x = high_use, y = absences))
g3 <- ggplot(df, aes(x = high_use, y = age))
g4 <- ggplot(df, aes(x = high_use, y = failures))

g1 + geom_bar()

g2 + geom_boxplot()

g3 + geom_boxplot()

g4 + geom_boxplot()

Next, I define the model with the chosen variables.

Summary tells that other variables but age were statistically significantly associated with high use of alcohol. Similar finding can be interpreted from OR and coefs. Biggest correlation was for sex and male sex having higher risk for high alcohol use.

#defining the model
model <- glm(high_use ~ absences + sex + age + failures, data = df, family = "binomial")

summary(model)
## 
## Call:
## glm(formula = high_use ~ absences + sex + age + failures, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1832  -0.8265  -0.6036   1.0200   2.1041  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.72535    1.74828  -2.131 0.033100 *  
## absences     0.09019    0.02334   3.864 0.000111 ***
## sexM         1.00045    0.24754   4.042 5.31e-05 ***
## age          0.10850    0.10505   1.033 0.301694    
## failures     0.56451    0.21098   2.676 0.007459 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 405.92  on 365  degrees of freedom
## AIC: 415.92
## 
## Number of Fisher Scoring iterations: 4
coef(model)
## (Intercept)    absences        sexM         age    failures 
## -3.72534682  0.09019383  1.00045285  0.10849680  0.56450896
# compute odds ratios (OR)
OR <- coef(model) %>% exp

# compute confidence intervals (CI)
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR        2.5 %    97.5 %
## (Intercept) 0.02410474 0.0007493589 0.7203136
## absences    1.09438639 1.0477584017 1.1483784
## sexM        2.71951309 1.6839636172 4.4530085
## age         1.11460134 0.9076445049 1.3713869
## failures    1.75858403 1.1692521450 2.6916472

Next, I calculate a probability of high alcohol use based on the new model. By comparing the predictability and true values, we see that the model predicts quite badly if the participant is a high user or not. The model has about 70% chance of getting it right. the mean predictive error is 0.3, so about 30% of the predictions go wrong.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#new model with the significant variables
model <- glm(high_use ~ absences + sex + failures, data = df, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(model, type = "response")

df <- mutate(df, probability = predict(model, type = "response"))
df <- mutate(df, prediction = probability > 0.5)

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(df, aes(x = high_use, y = probability))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = df$high_use, prediction = df$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252    7
##    TRUE     78   33
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = df$high_use, prob = 0)
## [1] 0.3

Clustering and classification

First, I loaded the Boston dataset, which contains house price data from sity suburbs in Boston, and also including characteristics of the suburb areas like crime rate, age of the buildings and average number of rooms.

The dataset has 506 rows and 14 columns.

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

#structure and dimensions
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Next, taking summary and some graphical looks in the data…

Distance from employment centers seems to have high negative correlation with many variables, esp. indus, nos and age. Positive correlations are more scattered.

By drawing plots about median prices compared to other variables, unsurprisingly room numbers have the highest correlation.

#summary
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) 

# print the correlation matrix
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")

#also exploring distributions in some of the key variables
library(ggplot2)
library(tidyr)
var <- c("crim","age","rm","dis","medv")
df <- Boston
gather(df[,var]) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free")

#exploring associations with alcohol use
g1 <- ggplot(df, aes(x = medv, y = age))
g2 <- ggplot(df, aes(x = medv, y = crim))
g3 <- ggplot(df, aes(x = medv, y = rm))
g4 <- ggplot(df, aes(x = medv, y = dis))

g1 + geom_point()

g2 + geom_count()

g3 + geom_point()

g4 + geom_point()

Next, I scale (or standardize) the dataset and create a categorical variable on quantiles of crime rate. Scaling transforms the variables to difference from average per standard deviation.

Variables are now distributed under and over zero as they are measuring the difference from average rather than the absolute value.

Lastly, I made datasets from random 80% of rows and another from the remaining.

#scaling
s_df <- scale(df)

summary(s_df)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
#creating the dataframe and setting crime rate as a numerical variable
s_df <- as.data.frame(scale(Boston))
s_df$crim <- as.numeric(s_df$crim)

#creating the cathegorical variable - first find the quantiles
bins <- quantile(s_df$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(s_df$crim, breaks = bins, include.lowest = TRUE)

#adding the variable to the dataset
s_df <- dplyr::select(s_df, -crim)
s_df <- data.frame(s_df, crime)

#converting cathegories to low, med_low, med_high and high
s_df$crime1 <- ifelse(s_df$crime == "[-0.419,-0.411]","low",ifelse(s_df$crime == "(-0.411,-0.39]","med_low",
       ifelse(s_df$crime == "(-0.39,0.00739]","med_high","high")))

#choosing random 80%
ind <- sample(506,  size = 506 * 0.8)

#train dataset including 80% of rows
train <- s_df[ind,]

# create test set with remaining
test <- s_df[-ind,]

Next, i performed a linear discriminant analysis.

Rad (index of accessibility to radial highways), has the biggest correlation an so has the highest contribution to first and second discriminant. The first linear discriminant has about 95% of the predictive power in separating the categories.

Biplot describes the same phenomena, on the x-axis LD1 (first dicriminant) shows clear separation of the categories, and arrow showing the correlation for rad is the largest and so having the most influense in the separation between categories.

#formula for the analysis
colnames(train)
##  [1] "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"    
##  [8] "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"    "crime"  
## [15] "crime1"
lda.fit <- lda(crime ~ zn + indus + chas + nox + rm + age + dis + 
                 rad + tax + ptratio + black + lstat + medv, data = train)
lda.fit
## Call:
## lda(crime ~ zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat + medv, data = train)
## 
## Prior probabilities of groups:
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##       0.2549505       0.2425743       0.2574257       0.2450495 
## 
## Group means:
##                         zn      indus        chas        nox          rm
## [-0.419,-0.411]  1.0163736 -0.8885843 -0.11943197 -0.8890101  0.40394961
## (-0.411,-0.39]  -0.1278160 -0.3222216 -0.07145661 -0.5596345 -0.17298546
## (-0.39,0.00739] -0.3866438  0.2558808  0.21980846  0.4303870  0.08250778
## (0.00739,9.92]  -0.4872402  1.0149946 -0.03371693  1.0572965 -0.47757670
##                        age        dis        rad        tax     ptratio
## [-0.419,-0.411] -0.9403249  0.9084867 -0.6863918 -0.7178521 -0.41894352
## (-0.411,-0.39]  -0.3466774  0.3301839 -0.5435787 -0.5117410 -0.09540841
## (-0.39,0.00739]  0.4278798 -0.3912372 -0.4197849 -0.2834153 -0.25305084
## (0.00739,9.92]   0.7787188 -0.8391287  1.6596029  1.5294129  0.80577843
##                       black       lstat        medv
## [-0.419,-0.411]  0.37656652 -0.77644152  0.49300127
## (-0.411,-0.39]   0.31791413 -0.13053310 -0.01410717
## (-0.39,0.00739]  0.03709248  0.01820649  0.15691409
## (0.00739,9.92]  -0.74112383  0.85280978 -0.64144898
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.112281617  0.65955925 -1.01720763
## indus    0.046554090 -0.23237937  0.17453319
## chas    -0.068330251 -0.09313893  0.04560553
## nox      0.229932955 -0.78579773 -1.35855537
## rm      -0.151941724 -0.12171536 -0.24712574
## age      0.257458308 -0.31695461 -0.05714234
## dis     -0.123951142 -0.25492166  0.11076816
## rad      3.440568631  1.03322531  0.25171683
## tax      0.009648052 -0.07146369  0.25722303
## ptratio  0.135580115 -0.01150357 -0.43735711
## black   -0.136497949  0.02398717  0.14336826
## lstat    0.163378390 -0.28908112  0.28255371
## medv     0.198842140 -0.41231128 -0.26618555
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9464 0.0410 0.0126
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

Next, I test if the model can predict the crime categories well. crime categories were already saved to crime1 variable, so I do not have to do it here.

All high crime areas were rightly predicted, also med-low category was well predicted, other 2 categories did not go so well.

#I already saved the variable to crime1 before
#removing the crime variable
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
lda.pred
## $class
##   [1] (-0.411,-0.39]  (-0.411,-0.39]  (-0.39,0.00739] (-0.39,0.00739]
##   [5] (-0.411,-0.39]  [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39] 
##   [9] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
##  [13] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39]  (-0.39,0.00739]
##  [17] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39]  (-0.411,-0.39] 
##  [21] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
##  [25] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39]  (-0.411,-0.39] 
##  [29] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
##  [33] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
##  [37] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39]  (-0.39,0.00739]
##  [41] (-0.39,0.00739] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39] 
##  [45] (-0.411,-0.39]  (-0.411,-0.39]  [-0.419,-0.411] (-0.39,0.00739]
##  [49] (-0.39,0.00739] (-0.39,0.00739] [-0.419,-0.411] (-0.411,-0.39] 
##  [53] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39]  [-0.419,-0.411]
##  [57] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
##  [61] (-0.411,-0.39]  (-0.411,-0.39]  [-0.419,-0.411] (-0.411,-0.39] 
##  [65] [-0.419,-0.411] (-0.411,-0.39]  [-0.419,-0.411] [-0.419,-0.411]
##  [69] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
##  [73] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
##  [77] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
##  [81] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
##  [85] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
##  [89] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
##  [93] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
##  [97] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [101] (-0.39,0.00739] (-0.39,0.00739]
## Levels: [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 
## $posterior
##     [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 6      3.415147e-01   5.935579e-01    6.492744e-02   1.340458e-21
## 8      2.294675e-02   4.928168e-01    4.842365e-01   3.374273e-16
## 22     7.741513e-02   4.014770e-01    5.211079e-01   6.362517e-17
## 23     6.263573e-02   4.559235e-01    4.814408e-01   5.742044e-17
## 45     2.454011e-01   7.383210e-01    1.627793e-02   2.115320e-22
## 56     9.953927e-01   3.881981e-03    7.253431e-04   1.940465e-20
## 65     4.674795e-01   4.685550e-01    6.396551e-02   2.950828e-22
## 70     3.452631e-01   6.470159e-01    7.720948e-03   4.222389e-21
## 71     3.135011e-01   6.787834e-01    7.715573e-03   4.528157e-22
## 74     2.880272e-01   7.047996e-01    7.173140e-03   6.492172e-22
## 89     3.974872e-01   5.125633e-01    8.994953e-02   2.350173e-22
## 91     3.741189e-01   5.763610e-01    4.952015e-02   2.605435e-22
## 93     5.851190e-01   3.687635e-01    4.611749e-02   4.092397e-19
## 98     3.900779e-01   4.451796e-01    1.647425e-01   9.383788e-23
## 100    4.164040e-01   5.190130e-01    6.458303e-02   5.132045e-23
## 101    8.059273e-02   3.117294e-01    6.076778e-01   5.802136e-16
## 105    7.722707e-02   4.584813e-01    4.642916e-01   2.235536e-15
## 109    6.979037e-02   4.054396e-01    5.247700e-01   1.850857e-15
## 115    6.016527e-02   5.586779e-01    3.811569e-01   6.748134e-15
## 117    5.215625e-02   5.595723e-01    3.882714e-01   4.830952e-15
## 123    2.196949e-02   2.517184e-01    7.263121e-01   6.772675e-19
## 129    7.752139e-03   6.970617e-02    9.225417e-01   4.651681e-16
## 143    5.406357e-05   2.538118e-03    9.974078e-01   3.777594e-15
## 144    5.842574e-05   1.982898e-03    9.979587e-01   4.809069e-14
## 148    5.042189e-05   2.669287e-03    9.972803e-01   1.621895e-13
## 171    1.111302e-02   3.664513e-01    6.224357e-01   2.163626e-15
## 177    1.769301e-01   7.416789e-01    8.139098e-02   3.723918e-18
## 186    1.817382e-01   6.505295e-01    1.677322e-01   1.453386e-19
## 190    8.275910e-01   1.566270e-01    1.578199e-02   3.211171e-19
## 202    9.899623e-01   9.905916e-03    1.318310e-04   1.682446e-24
## 203    9.955019e-01   4.176812e-03    3.213067e-04   1.941131e-25
## 205    9.958545e-01   2.543232e-03    1.602259e-03   6.486924e-21
## 209    6.678850e-02   7.685462e-01    1.646653e-01   1.397203e-19
## 212    1.549640e-02   7.875999e-01    1.969037e-01   8.388023e-18
## 217    4.487934e-02   6.755518e-01    2.795689e-01   4.868487e-18
## 219    1.261190e-02   4.967698e-01    4.906183e-01   9.851816e-17
## 224    3.674177e-02   4.180067e-01    5.452515e-01   1.897253e-12
## 228    4.155971e-02   3.440186e-01    6.144217e-01   8.798241e-13
## 230    1.933277e-01   6.182832e-01    1.883890e-01   2.229892e-14
## 232    5.044068e-02   3.412891e-01    6.082702e-01   2.794593e-13
## 238    5.907567e-02   3.712407e-01    5.696836e-01   1.291552e-13
## 239    6.656810e-01   3.269925e-01    7.326500e-03   6.142693e-19
## 242    3.127241e-01   6.560004e-01    3.127551e-02   9.041857e-17
## 243    3.752587e-01   5.940010e-01    3.074022e-02   2.290106e-17
## 246    8.139849e-02   8.256501e-01    9.295140e-02   3.322418e-14
## 249    2.681731e-01   6.436664e-01    8.816041e-02   1.253964e-15
## 251    5.277318e-01   4.510993e-01    2.116891e-02   2.366502e-17
## 258    1.427054e-02   8.267458e-03    9.774620e-01   3.617225e-17
## 267    4.537491e-02   8.045334e-02    8.741718e-01   4.447768e-16
## 268    6.475447e-02   6.238779e-02    8.728577e-01   2.090502e-17
## 274    5.899907e-01   2.760679e-01    1.339414e-01   2.735015e-22
## 281    3.671798e-01   3.946099e-01    2.382103e-01   2.015368e-18
## 282    5.207789e-01   4.432095e-01    3.601160e-02   1.129943e-19
## 286    9.523437e-01   4.746874e-02    1.875337e-04   3.360524e-27
## 298    3.117581e-02   9.488873e-01    1.993688e-02   2.107370e-20
## 306    6.181484e-01   2.467095e-01    1.351421e-01   5.752393e-14
## 307    5.521633e-01   1.730366e-01    2.748001e-01   6.367792e-14
## 309    1.907165e-01   4.413061e-01    3.679774e-01   1.590655e-18
## 313    8.982266e-02   5.387228e-01    3.714545e-01   2.197643e-17
## 317    4.875802e-02   5.920085e-01    3.592334e-01   1.911876e-17
## 323    2.308400e-01   6.727829e-01    9.637710e-02   4.488557e-18
## 324    1.038083e-01   7.363605e-01    1.598312e-01   8.477375e-17
## 333    8.776257e-01   1.213803e-01    9.940770e-04   2.612947e-26
## 337    2.578908e-01   5.992810e-01    1.428282e-01   8.868653e-18
## 342    8.716369e-01   1.214222e-01    6.940936e-03   4.840294e-26
## 346    1.778123e-01   8.066802e-01    1.550750e-02   1.049464e-22
## 353    9.381368e-01   6.113021e-02    7.330175e-04   1.642421e-22
## 356    9.938894e-01   5.732705e-03    3.779450e-04   1.598365e-21
## 373    5.871092e-22   1.565019e-18    3.311061e-14   1.000000e+00
## 376    1.074323e-18   7.727281e-16    8.993652e-13   1.000000e+00
## 378    6.374323e-20   1.222753e-16    1.270061e-13   1.000000e+00
## 386    5.432352e-22   3.728041e-18    2.920217e-15   1.000000e+00
## 390    3.808832e-21   1.072531e-17    1.054298e-14   1.000000e+00
## 399    1.357478e-21   8.558663e-18    5.065810e-15   1.000000e+00
## 401    1.035770e-20   3.662835e-17    2.517011e-14   1.000000e+00
## 406    1.354651e-20   3.524724e-17    1.880527e-14   1.000000e+00
## 409    8.801032e-22   7.254620e-18    2.137810e-15   1.000000e+00
## 412    5.177426e-22   1.599451e-18    1.610409e-15   1.000000e+00
## 414    4.519492e-22   2.345874e-18    6.604927e-16   1.000000e+00
## 417    8.484241e-22   1.791049e-18    4.101712e-15   1.000000e+00
## 421    1.021559e-20   1.130415e-17    4.717060e-14   1.000000e+00
## 428    7.165423e-21   5.488031e-18    7.797905e-15   1.000000e+00
## 431    1.587739e-20   3.275321e-17    9.172562e-15   1.000000e+00
## 432    1.252249e-20   2.969521e-17    1.249061e-14   1.000000e+00
## 435    2.028959e-21   2.002038e-18    7.723915e-15   1.000000e+00
## 437    5.860635e-22   5.367276e-19    4.420327e-15   1.000000e+00
## 447    6.401314e-21   7.994719e-18    4.413807e-14   1.000000e+00
## 450    9.547125e-21   1.520965e-17    4.467088e-14   1.000000e+00
## 454    1.333891e-19   1.353897e-16    7.089496e-13   1.000000e+00
## 457    2.367026e-22   3.568941e-19    1.817851e-15   1.000000e+00
## 460    5.635002e-20   6.871138e-17    1.718054e-13   1.000000e+00
## 463    1.437042e-19   1.429380e-16    3.501304e-13   1.000000e+00
## 475    5.016288e-20   2.027825e-16    1.953482e-14   1.000000e+00
## 476    1.995229e-20   1.077959e-16    1.901657e-14   1.000000e+00
## 477    2.310025e-19   5.841359e-16    1.835539e-13   1.000000e+00
## 484    2.485149e-16   3.296629e-13    4.977201e-12   1.000000e+00
## 489    4.987598e-03   2.539640e-01    7.410484e-01   1.551178e-15
## 490    4.183949e-03   3.352381e-01    6.605779e-01   3.684271e-15
## 496    6.476228e-02   4.641939e-01    4.710439e-01   7.008098e-15
## 498    4.448427e-02   3.681579e-01    5.873578e-01   2.402264e-14
## 504    2.763678e-01   1.829549e-01    5.406774e-01   8.665412e-22
## 505    2.938760e-01   2.159274e-01    4.901965e-01   8.763529e-22
## 
## $x
##            LD1         LD2         LD3
## 6   -3.1923558 -0.19235611  0.55757233
## 8   -1.7017992 -1.11573040  0.89525879
## 22  -1.9460934 -0.82246127  0.06102509
## 23  -1.9492054 -0.87653956  0.31952688
## 45  -3.3582353  0.18853735  1.50189171
## 56  -2.6700815  2.66807938 -2.80351207
## 65  -3.3685120 -0.19952400  0.19035785
## 70  -3.0220483  0.96362366  1.50060935
## 71  -3.2671900  0.72187712  1.60252851
## 74  -3.2223808  0.75712289  1.70913410
## 89  -3.3959409 -0.45191037  0.21715857
## 91  -3.3725529 -0.17277859  0.60085127
## 93  -2.5592283  0.71043712 -0.03758029
## 98  -3.5051516 -0.84114012 -0.14828303
## 100 -3.5610255 -0.40959614  0.34632271
## 101 -1.6962837 -0.68096654 -0.26433524
## 105 -1.5521191 -0.44108647  0.21942070
## 109 -1.5659412 -0.55910670  0.10796820
## 115 -1.4188340 -0.34156728  0.60642093
## 117 -1.4489229 -0.43785891  0.67444837
## 123 -2.3751293 -1.90193079  0.15287188
## 129 -1.5510858 -1.84037024 -0.62364003
## 143 -0.9353114 -3.66866145 -1.18986080
## 144 -0.6472415 -3.40657862 -1.46703151
## 148 -0.5149545 -3.35263213 -1.12124487
## 171 -1.4511507 -1.35803857  0.88193353
## 177 -2.2878879  0.16080943  0.97781387
## 186 -2.6612851 -0.48069748  0.55819333
## 190 -2.5499969  1.34996557 -0.56532696
## 202 -3.7080274  2.65269093 -1.20205272
## 203 -3.9365380  2.01683972 -2.36041812
## 205 -2.7938626  2.17721583 -3.51555420
## 209 -2.6182811 -0.87181774  1.23097529
## 212 -2.0897842 -1.16513418  1.91125587
## 217 -2.2087457 -0.96575172  1.08485298
## 219 -1.8076749 -1.47206988  1.20726749
## 224 -0.7621965 -0.19904890  0.42291744
## 228 -0.8497482 -0.27972440  0.13380723
## 230 -1.3360576  0.57743475  0.38883978
## 232 -0.9871677 -0.30274110  0.03585694
## 238 -1.0831097 -0.27806226  0.06259134
## 239 -2.4764510  1.70249371  0.53831517
## 242 -1.9377516  1.14782622  0.95423762
## 243 -2.0962766  1.10284469  0.78147905
## 246 -1.2410032  0.61884922  1.38590687
## 249 -1.6587436  0.81767562  0.58072798
## 251 -2.0925362  1.42353119  0.50594237
## 258 -1.7911176 -1.86724244 -2.91980370
## 267 -1.6538395 -1.11646995 -1.37090278
## 268 -2.0037204 -1.25481702 -1.77578523
## 274 -3.3860582 -0.47915014 -0.72007487
## 281 -2.3952518 -0.13555998 -0.41611606
## 282 -2.6976432  0.66884409  0.29789259
## 286 -4.4627895  1.90156420  0.14373769
## 298 -2.7502141 -0.30848792  2.68723285
## 306 -1.2514162  1.28530710 -0.92075880
## 307 -1.2365966  0.89978734 -1.48229067
## 309 -2.4004042 -0.63042330 -0.15581237
## 313 -2.0756837 -0.69321689  0.39952649
## 317 -2.0615102 -0.93157739  0.81279698
## 323 -2.2813178  0.19999811  0.68197550
## 324 -1.9261573 -0.09694184  0.95534856
## 333 -4.3007391  1.23996665  0.36178867
## 337 -2.2157215  0.11244318  0.35474103
## 342 -4.2740386  0.33842795 -0.43344910
## 346 -3.4213538  0.02081844  1.77059614
## 353 -3.2990508  2.21281081 -0.21284678
## 356 -2.9479569  2.76113027 -2.16616018
## 373  6.7872050 -0.99694417 -1.00907735
## 376  6.0904690  0.38159605 -0.46621758
## 378  6.3501803  0.21677689  0.07452078
## 386  6.8124329  0.16860787  0.82829745
## 390  6.6426308  0.31388871  0.28427126
## 399  6.7215254  0.26403985  0.90291435
## 401  6.5257833  0.28726605  0.55180010
## 406  6.5192966  0.53689875  0.49879805
## 409  6.7693099  0.51557425  1.32510363
## 412  6.8588431  0.43918816  0.31485557
## 414  6.8714449  0.82460635  1.10360148
## 417  6.8081830  0.17602550 -0.21605018
## 421  6.5558029 -0.03090925 -0.78512611
## 428  6.6401602  0.71057140 -0.53385330
## 431  6.5291753  0.95248778  0.64429422
## 432  6.5386267  0.70622936  0.54829157
## 435  6.7441134  0.21125179 -0.81773147
## 437  6.8701976 -0.01160263 -1.17149204
## 447  6.5946881 -0.18486679 -0.83938078
## 450  6.5497344 -0.02999175 -0.45401352
## 454  6.2698985 -0.33505477 -0.91331957
## 457  6.9525084  0.06403242 -0.72143944
## 460  6.3712015  0.01742138 -0.51904957
## 463  6.2793127  0.04141391 -0.61194664
## 475  6.3851258  1.04351680  1.43186937
## 476  6.4576824  0.68882095  1.32957774
## 477  6.2167619  0.55190812  0.71143257
## 484  5.5433458  1.72101321  1.64966562
## 489 -1.4361078 -1.79326447  0.88174886
## 490 -1.3381426 -1.72650710  1.27202238
## 496 -1.4163680 -0.41359798  0.31066903
## 498 -1.2556558 -0.55931575  0.19325743
## 504 -3.2328031 -1.36190332 -1.28944842
## 505 -3.2387285 -1.28776934 -1.12752834
# cross tabulate the results
table(correct = test$crime1, predicted = lda.pred$class)
##           predicted
## correct    [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##   high                   0              0               1             27
##   low                   12             11               1              0
##   med_high               0              7              14              1
##   med_low                5             15               8              0

Next, reload and standardize the boston dataset.

Then, I try to determine the optimal number of clusters. This can be done by looking when total of within cluster sum of squares drops radically when increasing the number of clusters. This time it seems to be number 2.

Interpretation of the plot: again we see that rad is the best predictor for clustering (separates black and red dots) and works as well with different other variables. Tax could be the second best.

#reload and scale
data("Boston")
df <- scale(Boston)

#distances
dist_df <- dist(df)
summary(dist_df)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
#k-means
km <- kmeans(df, centers = 4)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(df, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# k-means clustering
km <- kmeans(df, centers = 2)

# plot the Boston dataset with clusters
pairs(df, col = km$cluster)

Bonus

First, reloading boston dataset and scaling it.

In the biplot on the x-axis LD1 (first dicriminant) shows clear separation of the categories, and arrow showing the correlation for rad is the largest and so having the most influense in the separation between categories.

# access the MASS package
library(MASS)

# load the data
data("Boston")

#scaling and forming a dataframe
scaled <- as.data.frame(scale(Boston))

lda_df <- lda(crime ~ zn + indus + chas + nox + rm + age + dis + 
                 rad + tax + ptratio + black + lstat + medv, data = train)

# the function for arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(scaled$crime)

# plot the lda results
plot(lda_df, dimen = 2)
lda.arrows(lda_df, myscale = 1)




Chapter 5

First, lets set the rownames to country names and explore the data and distributions.

#setting rownames to countries

df <- read.csv("human.csv")
row.names(df) <- df$Country
df <- df[,2:9]

#exploring the data and distributions
library(GGally)

ggpairs(df)

library(corrplot)
cor(df) %>% corrplot()

Maternal mortality and life expectancy are highly correlated as well as adolecent birth rate and maternal mortality. Also some other very logical correlations like life expectancy and expected years of schooling. Distributions of ado.birth, mat.mort, GNI, and life.exp are highly skewed, other distribution look more symmetric/normal.

Performing principal component analysis. First, is the standardized data analysis.

#standardization
df_s <- scale(df)

# print out summaries of the standardized variables
summary(df_s)
##    Edu.Ratio        Labor.Ratio         Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_df <- prcomp(df_s)
pca_df
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                     PC1         PC2         PC3         PC4        PC5
## Edu.Ratio   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labor.Ratio  0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp     -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp    -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI         -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor      0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth    0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F     -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                     PC6         PC7         PC8
## Edu.Ratio    0.17713316  0.05773644  0.16459453
## Labor.Ratio -0.03500707 -0.22729927 -0.07304568
## Edu.Exp     -0.38606919  0.77962966 -0.05415984
## Life.Exp    -0.42242796 -0.43406432  0.62737008
## GNI          0.11120305 -0.13711838 -0.16961173
## Mat.Mor      0.17370039  0.35380306  0.72193946
## Ado.Birth   -0.76056557 -0.06897064 -0.14335186
## Parli.F      0.13749772  0.00568387 -0.02306476
# draw a biplot of the principal component representation and the original variables
biplot(pca_df, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

Mat.mor and life.exp have the biggest correlation to the first principal component, explaning most in the variance. labor.ratio and parli.f give the largest correlation to the second principal component.

Next lets do the same with non-standardized data.

pca_df_n <- prcomp(df)
pca_df_n
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                       PC1           PC2           PC3           PC4
## Edu.Ratio   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04
## Labor.Ratio  2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03
## Edu.Exp     -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02
## Life.Exp    -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02
## GNI         -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05
## Mat.Mor      5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03
## Ado.Birth    1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03
## Parli.F     -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01
##                       PC5           PC6           PC7           PC8
## Edu.Ratio   -0.0022935252  2.180183e-02  6.998623e-01  7.139410e-01
## Labor.Ratio  0.0022190154  3.264423e-02  7.132267e-01 -7.001533e-01
## Edu.Exp      0.1431180282  9.882477e-01 -3.826887e-02  7.776451e-03
## Life.Exp     0.9865644425 -1.453515e-01  5.380452e-03  2.281723e-03
## GNI         -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor      0.0266373214  1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth    0.0188618600  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F     -0.0716401914 -2.309896e-02 -2.642548e-03  2.680113e-03
biplot(pca_df_n, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

summary(df)
##    Edu.Ratio       Labor.Ratio        Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

The pca and plots are very different. As the data is non-standardised, GNI which has the largest scale and therefore largest number for variance/standard deviation, and will have the highest contiribution to PCA.

The analysis suggests that as edu.ratio, edu.exp, life.exp, GNI, mat.mor and ado.birth have about similar correlations to component one, the tend also to correlate to each other as such that when one variable changes, so change the others, either to the same direction or to the other depending on +/-. The same counts for Parli.f and labor.ratio which had the highest correlations to component 2.

Next, let’s work with the tea data.

#loading the data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

library(FactoMineR)
library(ggplot2)

#exploring the data
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
#taking the firs 10 variables
tea <- tea[,1:10]

#performing the analysis and giving the summary
mca <- MCA(tea, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.179   0.129   0.115   0.107   0.104   0.090   0.079
## % of var.             17.939  12.871  11.481  10.654  10.406   8.985   7.917
## Cumulative % of var.  17.939  30.810  42.291  52.945  63.351  72.337  80.253
##                        Dim.8   Dim.9  Dim.10
## Variance               0.076   0.068   0.054
## % of var.              7.559   6.793   5.395
## Cumulative % of var.  87.812  94.605 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             | -0.543  0.549  0.474 | -0.462  0.552  0.342 | -0.007  0.000
## 2             | -0.543  0.549  0.474 | -0.462  0.552  0.342 | -0.007  0.000
## 3             | -0.068  0.009  0.002 |  0.542  0.761  0.141 | -0.139  0.056
## 4             | -1.037  1.998  0.558 |  0.264  0.180  0.036 | -0.018  0.001
## 5             | -0.235  0.103  0.061 | -0.011  0.000  0.000 |  0.124  0.044
## 6             | -1.037  1.998  0.558 |  0.264  0.180  0.036 | -0.018  0.001
## 7             |  0.024  0.001  0.001 | -0.404  0.422  0.374 | -0.185  0.099
## 8             | -0.193  0.069  0.054 |  0.058  0.009  0.005 | -0.382  0.423
## 9             | -0.258  0.124  0.117 | -0.624  1.007  0.680 | -0.130  0.049
## 10            | -0.048  0.004  0.002 | -0.153  0.061  0.020 | -0.176  0.090
##                 cos2  
## 1              0.000 |
## 2              0.000 |
## 3              0.009 |
## 4              0.000 |
## 5              0.017 |
## 6              0.000 |
## 7              0.078 |
## 8              0.210 |
## 9              0.030 |
## 10             0.027 |
## 
## Categories (the 10 first)
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## breakfast     |   0.209   1.168   0.040   3.471 |  -0.770  22.090   0.547
## Not.breakfast |  -0.193   1.078   0.040  -3.471 |   0.710  20.391   0.547
## Not.tea time  |  -0.680  11.253   0.358 -10.351 |   0.327   3.635   0.083
## tea time      |   0.527   8.723   0.358  10.351 |  -0.254   2.817   0.083
## evening       |   0.445   3.787   0.103   5.562 |   0.634  10.728   0.210
## Not.evening   |  -0.233   1.980   0.103  -5.562 |  -0.332   5.609   0.210
## lunch         |   0.947   7.329   0.154   6.788 |   0.167   0.317   0.005
## Not.lunch     |  -0.163   1.260   0.154  -6.788 |  -0.029   0.054   0.005
## dinner        |  -1.571   9.633   0.186  -7.454 |   1.044   5.925   0.082
## Not.dinner    |   0.118   0.725   0.186   7.454 |  -0.079   0.446   0.082
##                v.test     Dim.3     ctr    cos2  v.test  
## breakfast     -12.786 |  -0.031   0.041   0.001  -0.518 |
## Not.breakfast  12.786 |   0.029   0.038   0.001   0.518 |
## Not.tea time    4.983 |   0.235   2.102   0.043   3.579 |
## tea time       -4.983 |  -0.182   1.630   0.043  -3.579 |
## evening         7.929 |  -0.599  10.742   0.188  -7.494 |
## Not.evening    -7.929 |   0.313   5.616   0.188   7.494 |
## lunch           1.195 |  -1.133  16.410   0.221  -8.125 |
## Not.lunch      -1.195 |   0.195   2.820   0.221   8.125 |
## dinner          4.951 |  -0.090   0.050   0.001  -0.428 |
## Not.dinner     -4.951 |   0.007   0.004   0.001   0.428 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.040 0.547 0.001 |
## tea.time      | 0.358 0.083 0.043 |
## evening       | 0.103 0.210 0.188 |
## lunch         | 0.154 0.005 0.221 |
## dinner        | 0.186 0.082 0.001 |
## always        | 0.089 0.096 0.414 |
## home          | 0.008 0.114 0.005 |
## work          | 0.215 0.006 0.251 |
## tearoom       | 0.315 0.003 0.018 |
## friends       | 0.325 0.141 0.008 |
plot(mca, invisible=c("ind"), graph.type = "classic")

plot(mca, invisible=c("ind"), graph.type = c("ggplot")) + theme(panel.grid.major = element_blank(),
   plot.title=element_text(size=14, color="blue"),
   axis.title = element_text(size=12, color="red"))
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Tea-time, work, tearoom, and friends are strongly correlated to the firs dimension and as such tend to change together. Breakfast, evening are strongly correlated to the second dimension. lunch, always and work are strongly correlated to the third dimension.


Analysis of longitudinal data

Analysis of the RATS data

In the first part, I will analyse the RATS data, by providing graphical summaries and using linear mixed model.

#read the data

RATS <- read.csv("RATS.csv")

#summary of the rats data
summary(RATS)
##        X                ID            Group          days          
##  Min.   :  1.00   Min.   : 1.00   Min.   :1.00   Length:176        
##  1st Qu.: 44.75   1st Qu.: 4.75   1st Qu.:1.00   Class :character  
##  Median : 88.50   Median : 8.50   Median :1.50   Mode  :character  
##  Mean   : 88.50   Mean   : 8.50   Mean   :1.75                     
##  3rd Qu.:132.25   3rd Qu.:12.25   3rd Qu.:2.25                     
##  Max.   :176.00   Max.   :16.00   Max.   :3.00                     
##       RATS            Time      
##  Min.   :225.0   Min.   : 1.00  
##  1st Qu.:267.0   1st Qu.:15.00  
##  Median :344.5   Median :36.00  
##  Mean   :384.5   Mean   :33.55  
##  3rd Qu.:511.2   3rd Qu.:50.00  
##  Max.   :628.0   Max.   :64.00
#Access the package ggplot2
library(ggplot2)
RATS$Group <- factor(RATS$Group)
RATS$ID <- factor(RATS$ID)

# Draw the plot
ggplot(RATS, aes(x = Time, y = RATS, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATS$RATS), max(RATS$RATS)))

I started by loading the data. the dataset contains longitudinal information on rats’ weights.

It seems that in group 2 and 3 the starting weight is much higher than in group1. The increase seems to be about the same.

Next, I will get the group means for different timepoints and draw a plot on that data.

# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
library(dplyr)
library(tidyr)

#stabdardize the data
RATSS <- RATS %>%
  group_by(Time) %>%
  mutate( stdRATS = (RATS - mean(RATS))/sd(RATS)) %>%
  ungroup()


#number of participants per group
RATSS$n <- ifelse(RATSS$Group == 1, 8, 4)

#calculate standard error
RATSS1 <- RATSS %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(RATS), se = sd(RATS)/sqrt(n) )%>%
  ungroup()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Group', 'Time'. You can override using the
## `.groups` argument.
# Plot the mean profiles
library(ggplot2)
ggplot(RATSS1, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(RATS) +/- se(RATS)")

Group 1 has smaller standard errors as other groups. This plot does not give much more information than the previous, group 2 and 3 probably increase a little faster than group 1.

Checking data outliers

# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
RATS1 <- RATS %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(RATS) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
RATSS1 <- RATSS %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(RATS), se = sd(RATS)/sqrt(n) )%>%
  ungroup()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Group', 'Time'. You can override using the
## `.groups` argument.
ggplot(RATS1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), weeks 1-8")

Well, in group 2 there is one outlier, but does not seem to be too far away.

T test sould not be performed as there are 3 groups, but lets do the anova.

# Add the baseline from the original data as a new variable to the summary data
RATS2 <- RATS1 %>%
  mutate(baseline = RATS[RATS$Time == 1,]$RATS)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATS2)

anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The weights increased statistically significantly from the baseline, but were not significantly different between groups.

Analysis of BPRS data

First, load the data and plot the ratings per weeks.

#read the data

BPRS <- read.csv("BPRS.csv")

#summary of the BPRS data
summary(BPRS)
##        X            treatment      subject         weeks          
##  Min.   :  1.00   Min.   :1.0   Min.   : 1.00   Length:360        
##  1st Qu.: 90.75   1st Qu.:1.0   1st Qu.: 5.75   Class :character  
##  Median :180.50   Median :1.5   Median :10.50   Mode  :character  
##  Mean   :180.50   Mean   :1.5   Mean   :10.50                     
##  3rd Qu.:270.25   3rd Qu.:2.0   3rd Qu.:15.25                     
##  Max.   :360.00   Max.   :2.0   Max.   :20.00                     
##       BPRS            week  
##  Min.   :18.00   Min.   :0  
##  1st Qu.:27.00   1st Qu.:2  
##  Median :35.00   Median :4  
##  Mean   :37.66   Mean   :4  
##  3rd Qu.:43.00   3rd Qu.:6  
##  Max.   :95.00   Max.   :8
#Access the package ggplot2
library(ggplot2)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

# Draw the plot
ggplot(BPRS, aes(x = week, y = BPRS, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRS$BPRS), max(BPRS$BPRS)))

Rating seem to be decreasing in both groups. Variability also decreasing.

Next let’s create the regression model.

model1 <- lm(BPRS ~ treatment + week, data = BPRS) 
model1
## 
## Call:
## lm(formula = BPRS ~ treatment + week, data = BPRS)
## 
## Coefficients:
## (Intercept)   treatment2         week  
##     46.4539       0.5722      -2.2704

It shows large covariance with time and smaller with treatment groups. This means ratings are strongly decreasing as time goes and treament group 2 have littel higher scores.

Next, The Random Intercept Model.

# access library lme4
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(BPRS ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)

summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week + treatment + (1 | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Model quality seems to be high. similarly fixed effects show the same correlations.

Next, Random Slope Model and comparing it to the intercept model.

#random slope model
BPRS_ref1 <- lmer(BPRS ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
#comparison with anova
anova(BPRS_ref, BPRS_ref1)
## Data: BPRS
## Models:
## BPRS_ref: BPRS ~ week + treatment + (1 | subject)
## BPRS_ref1: BPRS ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week + treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

P-value is 0.026, meaning that slope model fits well with the intercept model.

Next, lets add the intercept of time and treatment group

BPRS_ref2 <- lmer(BPRS ~ week + treatment + (week | subject) + week*treatment, data = BPRS, REML = FALSE)

anova(BPRS_ref1, BPRS_ref2)
## Data: BPRS
## Models:
## BPRS_ref1: BPRS ~ week + treatment + (week | subject)
## BPRS_ref2: BPRS ~ week + treatment + (week | subject) + week * treatment
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week + treatment + (week | subject) + week * treatment
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

Fits ok but not significant.

Last, plotting with raw and fitted values.

#raw values
ggplot(BPRS, aes(x = week, y = BPRS, group = subject)) +
  geom_line() +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "scores") +
  theme(legend.position = "top")

Fitted <- fitted(BPRS_ref2)

BPRS$fitted <- Fitted

#with fitted values
ggplot(BPRS, aes(x = week, y = fitted, group = subject)) +
  geom_line(aes(linetype = subject)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "scores") +
  theme(legend.position = "top")

Fitted values produce clearer plot as the values are adjusted for time and treatment group. Scores are decreasing as time goes